Multiword Expressions Dataset for Indian Languages
نویسندگان
چکیده
Multiword Expressions (MWEs) are used frequently in natural languages, but understanding the diversity in MWEs is one of the open problem in the area of Natural Language Processing. In the context of Indian languages, MWEs play an important role. In this paper, we present MWEs annotation dataset created for Indian languages viz., Hindi and Marathi. We extract possible MWE candidates using two repositories: 1) the POS-tagged corpus and 2) the IndoWordNet synsets. Annotation is done for two types of MWEs: compound nouns and light verb constructions. In the process of annotation, human annotators tag valid MWEs from these candidates based on the standard guidelines provided to them. We obtained 3178 compound nouns and 2556 light verb constructions in Hindi and 1003 compound nouns and 2416 light verb constructions in Marathi using two repositories mentioned before. This created resource is made available publicly and can be used as a gold standard for Hindi and Marathi MWE systems.
منابع مشابه
A data-driven approach to verbal multiword expression detection. PARSEME Shared Task system description paper
Multiword expressions are groups of words acting as a morphologic, syntactic and semantic unit in linguistic analysis. Verbal multiword expressions represent a subgroup of multiword expressions, namely that in which a verb is the syntactic head of the group considered in its canonical (or dictionary) form. All multiword expressions are a great challenge for natural language processing, but the ...
متن کاملCan Recognising Multiword Expressions Improve Shallow Parsing?
There is significant evidence in the literature that integrating knowledge about multiword expressions can improve shallow parsing accuracy. We present an experimental study to quantify this improvement, focusing on compound nominals, proper names and adjectivenoun constructions. The evaluation set of multiword expressions is derived from WordNet and the textual data are downloaded from the web...
متن کاملModeling the Statistical Idiosyncrasy of Multiword Expressions
The focus of this work is statistical idiosyncrasy (or collocational weight) as a discriminant property of multiword expressions. We formalize and model this property, compile a 2-class dataset of MWE and non-MWE examples, and evaluate our models on this dataset. We present a possible empirical implementation of collocational weight and study its effects on identification and extraction of MWEs...
متن کاملA Comprehensive Dictionary of Multiword Expressions
It has been widely recognized that one of the most difficult and intriguing problems in natural language processing (NLP) is how to cope with idiosyncratic multiword expressions. This paper presents an overview of the comprehensive dictionary (JDMWE) of Japanese multiword expressions. The JDMWE is characterized by a large notational, syntactic, and semantic diversity of contained expressions as...
متن کاملUSzeged: Identifying Verbal Multiword Expressions with POS Tagging and Parsing Techniques
The paper describes our system submitted for the Workshop on PARSEME’s Shared Task on automatic identification of verbal multiword expressions . It uses POS tagging and dependency parsing to identify singleand multi-token verbal MWEs in text. Our system is language-independent and competed on nine of the eighteen languages. Our paper describes how our system works and gives its error analysis f...
متن کامل